Efficient high-resolution refinement in cryo-EM with stochastic gradient descent

Electron cryomicroscopy (cryo-EM) is an imaging technique widely used in structural biology to determine the three-dimensional structure of biological molecules from noisy two-dimensional projections with unknown orientations. As the typical pipeline involves processing large amounts of data, efficient algorithms are crucial for fast and reliable results. The stochastic gradient descent (SGD) algorithm has been used to improve the speed of ab initio reconstruction, which results in a first, low-resolution estimation of the volume representing the molecule of interest, but has yet to be applied successfully in the high-resolution regime, where expectation-maximization algorithms achieve state-of-the-art results, at a high computational cost. In this article, we investigate the conditioning of the optimization problem and show that the large condition number prevents the successful application of gradient descent-based methods at high resolution. Our results include a theoretical analysis of the condition number of the optimization problem in a simplified setting where the individual projection directions are known, an algorithm based on computing a diagonal preconditioner using Hutchinson’s diagonal estimator, and numerical experiments showing the improvement in the convergence speed when using the estimated preconditioner with SGD. The preconditioned SGD approach can potentially enable a simple and unified approach to ab initio reconstruction and high-resolution refinement with faster convergence speed and higher flexibility, and our results are a promising step in this direction.


Introduction
We consider the problem of gradient-based optimization for tomographic reconstruction with a particular focus on electron cryomicroscopy (cryo-EM).Stochastic gradient descent (SGD) based optimization methods have become a standard algorithmic framework in many areas due to their speed, robustness, flexibility and ease of implementation, particularly with the availability of fast and robust automatic differentiation libraries.However, in cryo-EM, SGD methods have only really found use in ab initio structure determination, where the goal is only a low-resolution structure.Highresolution structures are then determined by switching to a different form of optimization, typically some form of expectation-maximization (EM), despite the fact that these methods are often slow and require specific modeling assumptions.A natural question which we address here then is why stochastic gradient optimization techniques have not been able to solve the high-resolution optimization problem.Doing so could further speed up data processing for cryo-EM, simplify workflows and unify open research questions.However, perhaps more significantly, using SGD-based optimization methods would allow for more flexibility in modeling the reconstruction problem.Common modeling assumptions (e.g., Gaussian noise, Gaussian priors, rigid structures, discrete Fourier-based structure representations) which may not be optimal but are required by current refinement methods could be relaxed with SGD methods.
In this paper, we show that standard SGD-based methods struggle to accurately determine high-resolution information in cryo-EM due to fundamental properties of the induced optimization problem.We perform a theoretical analysis, in a simplified setting, which shows that the induced optimization problem can suffer from ill-conditioning, which results in arbitrarily slow convergence for standard SGD algorithms.Moreover, our analysis shows that the conditioning may be acceptable at low resolution but becomes worse as resolution increases, explaining why SGD has been successful in the ab initio setting but has yet to be successfully used for high-resolution refinement.While our analysis is in a simplified setting, we argue and empirically verify that this poor conditioning behavior continues to exist in real-world cryo-EM settings.Finally, based on our analysis, we propose a new SGD-based algorithm which, unlike standard methods, exploits higher-order derivatives to improve the conditioning of the problem.Our results demonstrate that this new method is able to overcome the conditioning problem and efficiently and accurately determine high-frequency information.

Background
Cryo-EM enables biologists to analyze the structure of macromolecules in their native state.In comparison with the older method of X-ray crystallography, cryo-EM does not require crystallized samples, allowing one to study larger molecules with complex structure and conformations.Indeed, its potential to uncover the structure and function of macromolecules has been acknowledged by the scientific community: cryo-EM was named the "Method of the year" in 2015 by Nature Methods [1], and its development was the subject of the 2017 Nobel Prize in Chemistry.
The standard cryo-EM single particle analysis (SPA) pipeline involves freezing a biological sample in a thin layer of ice and imaging it with an electron microscope.The imaged sample contains multiple copies of a macromolecule captured in distinct, random and unknown orientations.Following a number of data processing steps, two-dimensional projections of the electrostatic potential of the macromolecule are captured in a stack of images, which we refer to as the particle images.The goal of the cryo-EM SPA pipeline is to reconstruct a three-dimensional volume representing the structure of a molecule from the collected particle images.
In addition to the random orientations, each projection of the volume is shifted off the center of the image by a small, unknown amount, and the particle images are further blurred by a contrast transfer function (CTF) which is image-specific and depends on the settings of the microscope.Moreover, to avoid the damaging of the biological sample by the electron beam, the imaging is done at a low dosage, which results in a poor signal-to-noise ratio (SNR).Therefore, cryo-EM reconstruction requires solving a tomography problem with unknown viewing directions and in-plane shifts, and low SNR.Here we refer to the particle orientations and the in-plane shifts collectively as the pose variables.
Cryo-EM reconstruction approaches implemented in established software packages [2,3] consist of two separate stages: ab initio reconstruction and high-resolution 3D refinement.Ab initio reconstruction provides an initial, low-resolution estimation of the volume.This is a non-convex problem for which many methods have been developed, but the current state-of-the-art methods are based on stochastic gradient descent algorithms which were first used in the context of cryo-EM in the cryoSPARC software [3,4].More recently, a similar approach has been adopted by other software packages such as RELION [5].After the ab initio step, high-resolution 3D refinement performs further optimization of the volume and generates a high-resolution reconstruction.This usually employs an optimization algorithm such as expectation-maximization [6] to iteratively reconstruct the volume and a search procedure to estimate the pose variables.The EM algorithm has become a standard approach to high-resolution refinement [2,3,7,8] and achieves state of the art reconstructions.However, EM-based methods are computationally expensive, generally requiring full passes through all images at each iteration of refinement and necessitating sophisticated grid and tree search algorithms to reduce the computational costs.Further, they are restricted to several key assumptions including Gaussian noise, a Gaussian prior on structures, a rigid structure, and a discrete Fourier-based representation of the structure.
Recently, a new class of methods for cryo-EM have emerged which aim to reconstruct not just static structures but also conformational heterogeneity, where the reconstructed volume is subject to different kinds of deformations [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25].Such methods greatly expand the capabilities of cryo-EM, e.g., with time-resolved cyro-EM [26,27].However, existing methods usually require a high resolution structure and known pose variables as input, limiting their applicability.Moreover, they cannot use the standard EM-based optimization approaches, often using neural networks trained using SGD instead, and generally fail to match existing rigid structure refinement approaches in resolution and requiring substantially more computation.While motivated by the promise of capturing conformational heterogeneity, we focus here on the static reconstruction case.Our results suggest that improving the performance of SGD-based methods may be the key to unlocking this new capability of cryo-EM.

Prior work
The Bayesian formulation of the cryo-EM reconstruction problem and its solution via the EM algorithm has a long history, with early works including [7,28,29] as well as their implementation for high-resolution 3D refinement in state-of-the-art software such as RELION [2,8] and cryoSPARC [3].While early ab initio reconstruction methods involved heuristic approaches such as using a low-pass filtered known reconstruction of a similar structure to the one of interest, mathematically rigorous approaches based on the method of common lines have been developed in [30][31][32].The SGD algorithm introduced in a cryo-EM context in [3,4] showed improved robustness and speed in obtaining ab initio reconstructions from scratch.More recently, the VDAM algorithm, a gradient-based algorithm with adaptive learning rate has been introduced in the latest version of the RELION software [5].This brief list of cryo-EM reconstruction algorithms is non-exhaustive and we refer the reader to more comprehensive review articles such as [33,34].
The aforementioned articles view the SGD algorithm and its variants as tools for the ab initio step, while the best results for high-resolution refinement are achieved using the EM algorithm.In this work, we present an analysis of the conditioning of the reconstruction optimization problem and propose a method to improve the convergence of SGD for high-resolution refinement by on-the-fly estimation of a suitable preconditioner.While basic preconditioners are used in cryoSPARC [3] and several strategies for adapting the step size based on second order information are implemented in the VDAM algorithm [5], neither work addresses the conditioning of the problem explicitly, and the preconditioners used are not suitable in the high-resolution regime.In contrast, while we theoretically analyze the reconstruction problem in a simplified setting, our proposed solution is specifically designed to overcome the conditioning issue in a big data, high-resolution setting.We leverage ideas from the machine learning literature [35,36] to estimate a preconditioner that poses no significant additional computational cost over the ordinary SGD algorithm, does not require a particular initialization or warm start, and is straightforward to incorporate into existing SGD implementations in cryo-EM frameworks.Finally, we provide numerical experiments that show the feasibility of our preconditioning approach for estimating high resolution information.

Outline
The remaining parts of the article are structured as follows.In Section 2, we describe the mathematical setting of the cryo-EM reconstruction problem, as well as current approaches for high-resolution refinement using the EM algorithm and ab initio reconstruction using the SGD algorithm.Section 3 presents the main contributions.In Section 3.1, we analyze the condition number of the linear inverse problem in the simplified setting where the pose variables are known, while in Sections 3.2-3.4we describe several ideas that are part of a proposed construction of a preconditioner that allows SGD to overcome the conditioning issue.In Section 4, we provide numerical experiments that validate the theoretical contributions, and in Section 5 we conclude with a summary of the main advantages of the proposed approach and motivate a potential extension of this work to the fully general setting.

Forward model
The objective of cryo-EM reconstruction is to estimate a three-dimensional volume representing the shape of a molecule v from a stack of particle images {x i } N i=1 .A simple and frequently used model of the image formation process consists of the following steps: each particle image x i is formed by rotating the volume v by a three-dimensional rotation operator R i , projecting it along the z-axis, applying a small in-plane shift T i , convolving the result with a contrast transfer function (CTF) C i , and adding Gaussian noise.
This model is often formulated in the Fourier domain to speed up the computation of the projections by taking advantage of the Fourier slice theorem and the fast Fourier transform (FFT).The Fourier slice theorem states that the two-dimensional Fourier transform of a projection of a three-dimensional volume is the intersection of the three-dimensional Fourier transform of the volume with a plane passing through the origin of the coordinate axes, where the projection direction is determined by the orientation of the plane.
Let v ∈ C Mv be the (discretized) three-dimensional volume and x i ∈ C Mx , for i = 1, . . ., N , the particle images.Here, M v is the total number of voxels in a M × M × M grid discretization of the volume and, similarly, M x is the total number of pixels in a M × M grid discretization of the particle images1 .Without loss of generality, we assume that in this vectorized representation of the volume, the first M x = M × M elements of the volume correspond to the volume slice at z = 0 in the Fourier domain.Moreover, note that in this representation, both the CTF C i and the in-plane shift operator T i are diagonal matrices in C Mx×Mx , since in the Fourier domain, the convolution corresponds to element-wise multiplication and the in-plane shift corresponds to a phase factor.Since the volume and images are defined on discrete grids, the rotated volume and the initial grid coordinates are no longer aligned.Specifically, the value of a volume v acted on by a rotation operator R at the grid coordinates r is given by evaluating the volume v at the rotated coordinates R T r: (Rv)(r) = v(R T r).However, due to the possible misalignment between the coordinate grid that v is defined on and the rotated coordinates R T r, the value of v at R T r must be interpolated using its values at the neighboring grid points.
We define two projection operators that use nearest-neighbor and trilinear interpolation, respectively.In short, projection by nearest-neighbor interpolation assigns to the rotated grid point the value of the volume at the grid point closest to the rotated grid point, while projection by trilinear interpolation performs linear interpolation using the closest eight grid points to the rotated grid point and assigns the resulting value to it, as follows: Definition 2.1 (Projection operator P ϕ ).Let ϕ i ∈ SO(3)×C 2 denote the pose variable encapsulating the rotation matrix R and the diagonal shift matrix T , and {r j } Mx j=1 be the coordinates of the Fourier grid at z = 0. We define the following projection operators P ϕ ∈ C Mx×Mv : 1. Nearest-neighbor interpolation projection operator P nn ϕ : where k * (j) = argmin k∈{1,...,Mv} ∥r k − R T r j ∥ 2 , for all j = 1, . . ., M x .In this case, the matrix P nn ϕ has exactly one non-zero element in each row, and therefore (P nn ϕ ) * P nn ϕ ∈ R Mv×Mv is diagonal.

Trilinear interpolation projection operator
where v * j is obtained by using trilinear interpolation of v at R T r j , for all j = 1, . . ., M x .As the nearest-neighbor interpolation operator is a diagonal matrix, we will use it to establish theoretical results regarding the conditioning of the reconstruction problem.However, the trilinear interpolation operator is more common in practice, and therefore we will show in numerical experiments that our preconditioner estimation method applies to this case as well.
Remark 2.2.Since the nearest-neighbor projection operator P nn ϕ corresponds to a plane slice through the volume passing through the center of the coordinate axes that is approximated by the closest grid points to the plane, it follows from (1) that the indices of the non-zero elements in the diagonal matrix (P nn ϕ ) * P nn ϕ ∈ R Mv×Mv are the indices of these nearest-neighbor elements in the vectorized representation of the volume v ∈ C Mv .Moreover, the non-zero (diagonal) elements of (P nn ϕ ) * P nn ϕ are real and positive, as the shift matrix T is complex diagonal with elements of unit absolute value.
3. An alternative approach to the interpolation given in Definition 2.1 is to sample the volume on the rotated grid using non-uniform FFT [37][38][39] as done, for example, in [40].In [41], the structure of the matrix involving a projection and a backprojection is leveraged to obtain a fast preconditioner based on a circular convolution [42].
Given the projection operators defined above, we can state the forward model as: where P ϕi is one of the two projection operators in Definition 2.1, C i ∈ C Mx×Mx is a diagonal matrix corresponding to the CTF, and η i is the noise vector in the i-th image, with elements complex normally distributed with variance σ 2 .Both the rotation R i and the shift T i are specific to each image x i and not known.The CTF model is the same across all images, with image-specific parameters (e.g., defocus).We assume that the CTF model is known and its parameters have been estimated in advance, and that the noise variance σ 2 is constant across particle images and pixels and has also been estimated in advance.

Bayesian formulation and the EM algorithm
The standard approach to high-resolution 3D refinement in cryo-EM is to solve a Bayesian formulation of the volume reconstruction problem with marginalization over the pose variables.This is solved using the expectation-maximization algorithm [6], which has first been used in the context of cryo-EM for aligning and denoising particle images in [28] and further refined for full volume reconstruction in the software packages RELION [2] and cryoSPARC [3].
Computing the full posterior distribution of the volume v given the particle images {x i } N i=1 is computationally expensive.Instead, the maximum-a-posteriori (MAP) problem involves computing the volume v that maximizes the (log) posterior: where p(v) is the prior distribution of the volume v and p(x i |v) is the likelihood probability of observing the image x i given the volume v: where p(ϕ i ) is the prior distribution of the pose variable ϕ i of the particle image x i and p( is the likelihood of observing image x i given the volume v and pose ϕ i : The volume prior p(v) plays a regularization role in solving the reconstruction problem and, throughout this manuscript, we assume it is Gaussian with variance τ 2 for all elements of the volume: The EM algorithm finds the optimal volume v * in (4) by alternating between two steps at each iteration.The fist step computes the expected value of the posterior of the pose variables given the current estimate of the volume v (k) (the E step), while the second step computes a new iterate for the volume v (k+1) that maximizes this quantity (the M step).If the prior is Gaussian, the v (k+1) can be computed analytically by letting the derivatives be equal to zero.The resulting algorithm performs the following update at iteration k: M step: where In more general implementations of the EM algorithm in software packages such as RELION and cryoSPARC, the noise variance σ 2 and the prior variance τ 2 are not constant across image pixels and volume voxels respectively, and are both estimated at each iteration using the current volume iterate v (k) and weights Since the optimization landscape is non-convex, the EM algorithm can converge to a local maximum different from the MAP estimator [6], so a good initialization is required.Therefore, it is best used for high-resolution refinement, where an initial volume and possibly priors for the pose variables are provided by the ab initio algorithm.However, the E step is particularly expensive at high resolution, as it requires integrating over the entire space of the pose variable ϕ i for each particle image in the dataset.In practice, this is performed by employing efficient gridding and search techniques.

Stochastic gradient descent for ab initio reconstruction
The SGD algorithm has become the preferred method for solving the MAP problem (4) in the context of ab initio reconstruction.It was first used in [3] to obtain an initial volume reliably and without requiring a good initialization, as it is the case for EM.The key property that SGD exploits is that the objective function in (4) can be split into separate terms for each particle image, not unlike the loss functions used in the training of deep neural networks.In particular, the objective function (4) can be written as: where each term f i corresponds to a particle image: At iteration k, SGD performs the following update: where η k is the step size at iteration k and the index j of the term (and particle image) used to compute the gradient is chosen uniformly at random.In practice, a mini-batch of particle images, rather than a single image, is used to compute the gradient at each iteration.The volume update step (8) for the current mini-batch is performed after a refinement step of the pose distribution of each particle image in the current mini-batch given the current volume and data p(ϕ i |v (k) , x 1 , . . ., x N ), for example by using (6a) similarly to the EM algorithm.Various techniques can be used to speed up the computation time, for example the grid refinement implemented in RELION [2] or the branch-and-bound approach in cryoSPARC [3].
The SGD algorithm has two major advantages over EM.First, it has a lower computational cost per iteration, as it only uses a subset of the dataset, while EM requires a pass through the entire dataset.While computing the integrals in (5) for the images in the mini-batch cannot be avoided, the gridding and searching approaches that EM uses to efficiently sample the space of poses are also beneficial in the implementation of SGD.Second, it has been observed in practice that the noise in the sampled gradient enables SGD to explore the optimization landscape more efficiently before converging to a local minimizer of the objective function, preventing it from becoming stuck in unwanted local minima.Given these advantages, the SGD algorithm is particularly well-suited for ab initio reconstruction, as it has been shown in practice in cryoSPARC [3].More recently, the gradient-based algorithm VDAM has been introduced for ab initio reconstruction in the RELION software [5].However, despite the clear benefits in terms of computational cost and convergence speed of stochastic gradient algorithms compared to EM for (low resolution) ab initio reconstruction, EM is still the state-of-the-art approach for high-resolution refinement.
To investigate how the performance of SGD can be improved for high-resolution cryo-EM reconstruction, we first state a convergence result from [43]: Theorem 2.4 (Convergence of the stochastic gradient descent algorithm).Suppose that f : R M → R in (7) is twice-differentiable and strongly convex and its gradient is Lipschitz-continuous with constant L. Furthermore, we assume that there exists B ≥ 1 such that: Let v * be the minimizer of f and v (k) the iterate generated by the SGD iteration (8) with fixed step size η k = 1 LB 2 .Then: The main implication of Theorem 2.4 is that the rate of convergence of SGD is affected negatively by a large condition number of the Hessian matrix κ(∇ 2 f (v * )).In particular, we can derive that, to reach objective function error ϵ = E f (v (k) ) − f (v * ), at least k * iterations are required, where: Note that we chose this particular result for its simplicity, despite the rather strong condition (9).More general results for the convergence of SGD where this condition is relaxed can be found, for example, in [44][45][46].
To improve the convergence of SGD, we will precondition it using an approximation of the diagonal of the Hessian matrix, which we will compute using Hutchinson's diagonal estimator [47]: where z ∈ R Mv is a vector with elements drawn from a Rademacher(0.5)distribution (the entries of z are 1 or −1 with equal probability) and ⊙ denotes element-wise multiplication.
The central claim of this article is that the condition number of the Hessian of the loss function in the cryo-EM reconstruction problem scales unfavorably with the target resolution of the reconstruction, and therefore it affects the convergence of SGD for high resolution refinement.In the next section, we argue that this is indeed the case and we propose a solution for overcoming this issue in a simplified setting.

SGD for high-resolution refinement: fixed pose variables
In this section, we study the optimization problem (4) in the setting where the pose variables (the three-dimensional orientations and the in-plane shifts of the particles) are known.While this is a simpler problem that can be solved with other methods, it captures the main difficulty that makes the application of gradient-based algorithms non-trivial at high resolution, namely the large condition number of the Hessian.Therefore, analyzing the reconstruction problem with the known pose variables provides useful insights and directions for approaching the problem in its full generality.
For simplicity and without loss of generality, we assume, like in the previous section, that the variance of the noise σ 2 and the variance of the prior τ 2 are fixed and given in advance, and are constant across all images and pixels (in the case of σ 2 ) and across all voxels of the volume (in the case of τ 2 ).The analysis presented in this section can be generalized to the case where σ and τ are not constant, and the preconditioner estimation we propose has the flexibility to incorporate existing methods for determining these parameters in the reconstruction process.To simplify the notation, we collect these two parameters in one parameter λ = σ 2 τ 2 , which we will refer to as the regularization parameter.

Condition number
Having access to the true pose variable ϕ * i for each image x i is equivalent to taking the prior distribution for the pose variable ϕ i in ( 5) to be a Dirac delta distribution centered at the true value ϕ * i , namely p(ϕ i ) = δ ϕ * i .In addition, we assume for simplicity that the images are not deformed by the CTF: C i = I Mx , for all i = 1, . . ., N .With the regularization parameter λ described above and letting P i := P ϕ * i , we write the optimization problem (4) as: Letting f (v) be the objective function in (12) and f i (v) the i-th term: the optimization problem (12) becomes: The minimizer of ( 13) is the point v * ∈ C Mv that satisfies: where b = N i=1 P * i x i and H ∈ R Mv×Mv with: Note that, for the problem (12), H is the Hessian of the objective function.The SGD algorithm solves problem (13) by iteratively taking steps in the direction of negative sampled gradient: where η k is the step size at iteration k and the index j is selected uniformly at random.Its convergence properties are determined by the condition number of the matrix H, as stated in Theorem 2.4.
When the projection matrices P i are the nearest-neighbor interpolation matrices (1) in Definition 2.1, the matrices P * i P i are diagonal with real non-negative elements (see Remark 2.2), thus the Hessian matrix H is also diagonal with real non-negative elements.In this case, its condition number [48] is We will now analyze the structure and the condition number of the Hessian matrix H when the projection matrices correspond to nearest-neighbor interpolation, namely P i = P nn ϕ * i , for i = 1, . . ., N .In order to do so, we introduce two necessary concepts.First, the projection assignment function of a particle image maps each element of the image to the element of the volume whose value is assigned to it by the projection operator.Definition 3.1 (Projection assignment function).Let P i := P nn ϕi ∈ C Mx×Mv , i = 1, . . ., N , be nearest-neighbor interpolation projection matrices given in Definition 2.1 and corresponding to the pose variables ϕ i , i = 1, . . ., N .We define the projection assignment function Λ i : {1, 2, . . ., M x } → {1, 2, . . ., M v } as the function that maps each pixel index k of the i-th particle image to the voxel index Λ i [k] in the volume v whose value is assigned by the operator P i at index k.Namely, we have that: where the square brackets notation is used for the value of the image (in the left-hand side) or volume (in the right-hand side) at a particular index k, and T i [k] is the k-th element in the diagonal of the translation matrix T i .
Second, the voxel mapping set of a volume element contains the indices of the images that contain a projection of that volume element.Definition 3.2 (Voxel mapping set).For every voxel index j ∈ {1, . . ., M v }, we define the voxel mapping set Ω j ⊆ {1, . . ., N } as the set of indices of images that contain a pixel that is mapped by their projection assignment functions Λ i to j, namely: Given the functions Λ i and the sets Ω j defined above, the diagonal elements of the (diagonal) matrix P * i P i are (P * i P i ) jj = 0 if i / ∈ Ω j and (P * i P i ) jj ∈ {1, 2} otherwise.We make the following assumptions: 1. We assume that each voxel index j is mapped at most once by the projection assignment function of an image Λ i .
2. Without loss of generality, we assume that j = 1 is the index of the voxel corresponding to the center of the coordinate axes.Then, the voxel at j = 1 is mapped by all projection operators P i , i = 1, . . ., N , or equivalently, Ω 1 = N .
The second assumption above concerns the ordering of the elements in the vectorized representation of the grid, specifically so that the center is mapped to the element at index j = 1, while the first assumption simplifies our analysis (at a cost of a factor of at most two in the condition number bound below) by ensuring that the diagonal elements of P * i P i satisfy: Then, the full Hessian matrix H is also diagonal, with its diagonal elements given by: where |Ω j | is the cardinality of the set Ω j and 0 ≤ |Ω j | ≤ N , for all j = 1, . . ., M v .Finally, Proposition 3.4 below captures the main difficulty of the reconstruction problem, namely the condition number of the Hessian matrix increasing with the resolution.Proposition 3.4 (Condition number bound).Let M x = M 2 and M v = M 3 for grid length M and let Assumptions 3.3 hold for the nearest-neighbor interpolation projection matrices P i := P nn ϕi .Then, for any fixed number of images N , we have that: Proof.Since the matrices P i are the nearest-neighbor interpolation matrices P nn ϕi , the matrix H is also diagonal and, according to (16), to compute κ(H) we need to find the largest and smallest elements of H.
To compute min j H jj , note that the projection assignment functions Λ i for i = 1, . . ., N , map N M 2 image pixels to M 3 volume voxels.
For M > N , there are more voxels than total pixels (in all the images), and so there exists a voxel j * such that |Ω j * | = 0.Then, min j H jj = H j * j * = λ, and so κ(H) = N +λ λ .For M ≤ N , there are N M 2 pixels mapped to M 3 voxels, and therefore there exists a voxel j * such that |Ω j * j * | ≤ N M 2 /M 3 = N/M .Then, min j H jj ≤ N/M + λ, and so κ(H) ≥ N +λ N/M +λ .We can also write the bounds in Proposition 3.4 in terms of the radius R in the Fourier domain.If we assume that the number of pixels in a 2D disk of radius R is approximately πR 2 and the number of voxels in a 3D ball of radius R is approximately 4  3 πR 3 then, following a similar argument, we obtain: More generally, if the ratio of the number of pixels in a projected image and the number of voxels in a volume at a given resolution R is p(R), then the bounds in (19a) and (20a) can be written as: While the counting argument above shows that the condition number is large when there are more voxels to reconstruct than pixels in all the particle images, in practice, the condition number grows fast with the resolution due to an additional factor.Specifically, in light of the Fourier slice theorem, each image is used to reconstruct the voxels corresponding to a slice through the volume passing through the center of the coordinate axes.Therefore, the large condition number of the matrix H is also a consequence of the fact that the elements of H corresponding to low-frequency voxels are reconstructed using pixel values in most images, while the elements of H corresponding to high-frequency voxels are "seen" by fewer pixels in the particle images.Each new iteration will provide more information to the low-frequency voxels than to the high-frequency ones (relative to the total number of low-frequency voxels and high-frequency voxels, respectively), which leads to errors being amplified (or corrected) at different rates when solving the inverse problem.Lastly, this problem is exacerbated by preferred orientations of the particles: the orientation angles often do not cover SO(3) evenly in real datasets, causing the Fourier transform of the volume to miss entire slices.
In Figure 1, we illustrate the statement above for the setting of this section, specifically with nearest-neighbor interpolation in the projection operators and no CTF.In panel (a), we show the lower bound on κ(H) given in (21) as a function of the radius in the Fourier space, as well as the condition number for a dataset of N = 10000 particle images with uniformly sampled orientations with R ranging from 1 to 304 voxels at intervals of 16 and λ = 10 −8 .The condition number grows faster than the lower bound due to the effects described in the previous paragraph.To further illustrate the relationship between the number of images and the condition number, we show in panels (b,c) of Figure 1 the Hessian H when using nearest-neighbor interpolation and no CTF when the dataset contains N = 5 images (panel (b)) and when the dataset contains N = 100 images (panel (c)).This shows how the particle images contribute to a larger fraction of the voxels close to zero than those at a large Fourier radius.In light of the dependence of the rate of convergence of SGD on the condition number of the Hessian H given in Theorem 2.4 and equation (10), Figure 1(a) suggests that the number of iterations required to reach a certain error grows exponentially with the resolution.Since the root cause is the large condition number at high resolution, we will address this issue by preconditioning the SGD iterations, specifically by using an approximation of the diagonal D ∈ R Mv×Mv of H: Remark 3.5.For fixed σ, τ , taking λ = σ 2 /τ 2 and known poses, the matrix whose inverse appears in the M step of the EM algorithm in equation (6b) is the full Hessian matrix (15) of the loss function.Therefore, EM implicitly solves the conditioning issue that is problematic for SGD, and in our preconditioning approach, we aim to approximate, using mini-batches, this gradient scaling that EM applies at every iteration.
With the facts above regarding the condition number of the Hessian of the loss function, we now proceed to estimate the diagonal preconditioner for this matrix.

Computing the preconditioner
We aim to obtain an approximation of the diagonal of the Hessian matrix H and use it to precondition SGD, which is equivalent to preconditioning the linear system ( 14) using the Jacobi preconditioner [48].Motivated by algorithms such as AdaHessian and OASIS in the machine learning literature [36,49], we estimate the diagonal of the Hessian using Hutchinson's diagonal estimator [47] stated in (11).
Estimating the diagonal of H using (11) has two practical advantages.First, for any function f , applying (11) only requires computing Hessian-vector products, rather than forming the full Hessian matrix.Indeed, for a function f , a Hessian-vector product is computed efficiently using Jacobian-vector products and automatic differentiation as follows: Second, the computation can be split into mini-batches so that, at each iteration, only the current mini-batch of images is used for the Hessian-vector product computation.
The update of the preconditioner at the current iteration obtained using the current mini-batch is combined with the estimated preconditioner from the previous iteration using an exponential average as done, for example, in Adam [50], AdaHessian [49] and OASIS [36].In addition, to take advantage of the fact that the Hessian H in (15) of the objective function ( 12) is independent of the current iterate when the orientations are known, the exponential average is taken between the estimated preconditioner at the previous iteration and the estimated diagonal using all the samples of Rademacher vectors z drawn up to the current iteration.However, in the more general reconstruction problem with unknown pose variables (4), only the current update would be used.Starting with the identity matrix as the initial estimate, D (0) = I Mv×Mv , the update rule for the diagonal estimator D (k) at iteration k is: where β ∈ (0, 1) and the Hessian-vector product is computed using the current mini-batch.An example of the convergence of this estimate over 100 batches of 1000 images each and a total number of N = 50000 images is shown in Figure 2.
Remark 3.6.When the projection operator uses nearest-neighbor interpolation, i.e.P i := P nn ϕi , and therefore the Hessian matrix H is diagonal, Hutchinson's estimator with Rademacher samples (11) computes the exact diagonal of H using a single sample vector z or, if computed using mini-batches, after a single epoch.

Thresholding of the estimated preconditioner
As we will see in the numerical experiments section, the combination of variance of the sampled gradient using mini-batches and error in the diagonal approximation of the Hessian, especially in the early iterations, can lead to highly amplified errors in the current iterate v (k) .23), where H is the Hessian matrix of the loss function f in (12) with trilinear interpolation (and therefore, H is not diagonal).(a) Relative ℓ 2 error of the estimated diagonal of H using Hutchinson's diagonal estimator (11) and the proposed exponential average ( 23) between an initial estimate D (0) (here the identity matrix) and the average given by Hutchinson's estimator.The larger relative error of the proposed method at the first iterations is due to the error in The individual elements of the estimated diagonal plotted against their true values.The accuracy in the smaller elements is lower, as expected for highfrequency elements not mapped by the projection operators of many particle images, i.e. elements at high-frequency indices j with small |Ω j |.
In general, this problem can be avoided by using variance-reduced stochastic gradient methods [51,52], which require computing full gradients at a subset of the iterations or storing previously computed gradients.However, in a typical cryo-EM setting with large datasets, this has a prohibitive computational cost.
Instead, we propose a simple solution that leverages the particular structure of the cryo-EM reconstruction problem.Given the specific structure of the matrices P i and P * i P i (see Definition 2.1 and equation ( 17)) and the fact that the projection operators perform slicing in the Fourier domain (according to the Fourier slice theorem), the preconditioned SGD iteration (22) updates the elements of v (k) corresponding to high-frequency voxels at a lower rate using information in the particle images compared to the low-frequency voxels.Similarly, the elements of the estimated preconditioner D (k) corresponding to high-frequency voxels in the volume have small values, close to the regularization parameter λ, while the elements corresponding to low-frequency voxels have magnitudes that reflect the large number of images that contribute to the reconstruction of the voxels (see the discussion in the paragraph below equation ( 21)), and are scaled by the CTF at that particular resolution.In light of (18), this is due to |Ω j | being large for low-frequency elements and small for high-frequency elements of the diagonal of H.
This knowledge can be incorporated into the reconstruction algorithm in two ways.One approach is to tune the regularization parameter λ so that it balances the small entries in the preconditioner with the ability to obtain good convergence at high resolution.Alternatively, the same effect can be achieved without interfering with the regularization term by thresholding the smallest elements of the preconditioner, with the benefit of using the structure of the measurement operator in the preconditioner while allowing the freedom to chose the regularization term and parameter by other means, for example as done in RELION [2].Here, we opt for the latter approach, namely thresholding the elements of the preconditioning matrix at the current iteration D (k) below a certain value α, chosen as follows.
Let C i (r) be the radially symmetric CTF corresponding to the i-th image, as a function of the Fourier radius r ∈ [0, R], and let P x (r) and P v (r) be the number of pixels in a 2D Fourier shell at radius r and the number of voxels in a 3D Fourier shell at radius r, respectively.Following the steps in Section 3.1 with non-trivial CTFs and assuming that the particle orientations are uniformly distributed (the details are omitted for brevity), the expected value of an element of the Hessian matrix H at the maximum Fourier radius R is: Setting the threshold of the smallest elements of D (k) to α = H, the final preconditioner at iteration k is defined as:

The algorithm
With the preconditioner estimation approach proposed in Section 3.2 and the thresholding described in Section 3.3, we present the full algorithm in Algorithm 1.No warm start is required for the approximation of the diagonal preconditioner, which is initialized with the identity matrix and estimated iteratively.The step size η (k) at each iteration is set using the stochastic Armijo linesearch [35], namely the largest step size η is sought that satisfies the condition: where I k is the index set of the current mini-batch, D is the preconditioner, and c ∈ (0, 1) is a hyperparameter.Condition (26) ensures that a sufficient decrease in the objective function is attained over the current mini-batch at every iteration Draw index set I k and Rademacher(0.5)vector z (k) 6: 10: while end while 15: end for

Numerical experiments
In this section, we present numerical results that demonstrate the arguments given in this article regarding the convergence of the SGD algorithm for high resolution cryo-EM reconstruction.In particular, the numerical experiments in this section further verify two claims: • Claim 1: The condition number κ(H) of the Hessian of the loss function is large at high resolution, which leads to slow convergence of the SGD algorithm.
• Claim 2: Preconditioning SGD using the diagonal preconditioner as estimated using the tools described in Sections 3.2-3.4leads to improved convergence at high resolution.
The dataset used in this section is derived from the Electron Microscopy Public Image Archive database entry EMPIAR-10076 of the bacterial ribosome, where N = 50000 images of 192 × 192 pixels are selected and whose pose variables have been computed using RELION.The condition number of the reconstruction problem for this dataset with the computed pose variables is shown in Figure 3(a) for increasing values of the maximum Fourier radius R. Nearest-neighbor interpolation has been used for this figure, and therefore the condition number has been computed using (16).The condition number grows quickly with the resolution, and at the maximum radius R corresponding to the grid side length M = 192, it is of the order of 10 7 .Note that the bound in Proposition (3.4) and equations ( 20)-( 21), shown as the solid blue line in Figure 3(a), holds for the condition number when no CTF is included, shown as the red, dashed line.Because the CTF has magnitude less than one at the origin, the numerator in ( 16) is smaller that in the case when no CTF is used, and therefore the condition number computed in this setting, shown in the green dashdot line, is not always larger than the bound and plateaus at a lower value at high resolution.Moreover, in this plot we see the effect of the non-uniform orientations.In particular, above a certain radius (here approximately at 20 voxels from the origin), there are voxels that are not projected to any image pixels (namely their voxel mapping set Ω j has cardinality zero) and therefore the condition number reaches O(1/λ), in contrast to the plot shown in Figure 1(a), where uniform orientations have been simulated.The non-uniform orientations are seen in the cross-section of the diagonal of the Hessian matrix H in Figure 3(b), which also shows the effect of the CTF.
To verify the two claims on the dataset described above, we run three algorithms: SGD with no preconditioner, SGD with a preconditioner that has been precomputed in advance using (23), and Algorithm 1, namely SGD with a preconditioner estimated during the refinement process and thresholded appropriately.All algorithms use the stochastic Armijo line-search (26) for the step size adaptation.While the theory in Section 3.1 applies to nearest-neighbor interpolation, where the matrices P * i P i are diagonal (and therefore the optimization problem can be solved easily by other methods), in the numerical experiments presented in this section, the projection operators are implemented using trilinear interpolation, which is often used in practice.In this case, the Hessian matrix H, whose diagonal we estimate and use as a preconditioner, is no longer diagonal.
To evaluate and compare the performance of the three algorithms, we compute the value of the loss function in (12) after each epoch (a full pass through the dataset), as well as the Fourier Shell Correlation (FSC) of each reconstruction with a ground-truth solution, obtained using L-BFGS with 750 iterations.The FSC, a standard error measure in the cryo-EM literature, is the crosscorrelation coefficient between two volumes across three-dimensional shells in the Fourier domain [53].Specifically, given two volumes u and v in the Fourier space, the FSC at radius r from the origin is defined as where S r is the set of Fourier voxels in the spherical shell at radius r.
Figure 4 shows the results obtained using the three algorithms.In panel (a), we see the convergence of the loss function significantly improved when using the precomputed preconditioner (red) over not using a preconditioner (blue), verifying Claim 1 above.When estimating the preconditioner during refinement using Algorithm 1, the value of the loss (green, solid) is between those of the previous two algorithms and approaches the fully preconditioned SGD as the preconditioner progressively becomes more accurate, verifying Claim 2. In panel (b), we show the FSC between the final reconstructions of each of the three algorithms and the ground truth reconstruction.While at low resolution, the FSC value is high for all algorithms, showing good convergence, at high resolution the FSC degrades for SGD with no preconditioning (blue), while the two preconditioned algorithms have a high value, close to one.
In addition, panels (a) and (b) of Figure 4 also show the results of running Algorithm 1 without the thresholding of the estimated preconditioner described in Section 3.3 (green, dashed), highlighting the importance of this step.The loss initially decreases as expected, however it plateaus after two epochs.A clue for this issue is given in panel (c), showing the high variance of the stochastic gradient at high-frequency over one full epoch for an arbitrarily selected iterate (orange, dashed) and the large magnitude of the radial average of the high-frequency elements of D −1 (blue, dashed), where D is the estimated preconditioner without thresholding.This would not be a problem by itself if such entries in the preconditioner were computed exactly.However, due to the preconditioner not being accurately estimated early in the run of the algorithm, the high-frequency elements of the volume have a large error (due to the low |Ω j | of such high-frequency elements, they are updated using the correct value at a small subset of the individual iterations).Therefore, in the first iterations, the variance in the stochastic gradient amplifies the errors in the estimated preconditioner at high resolution, leading to a large loss in the high-frequency elements.Then, the step size adaptation leads to a small step size and, therefore, to slow overall convergence.This also explains the low value of the FSC when using a preconditioner with no thresholding even in comparison with the non-preconditioned SGD in panel (b) of Figure 4.
For further insight into the difference between SGD without a preconditioner and SGD with a (precomputed) preconditioner at low and high resolution, as well as how the estimated preconditioner behaves in relation with them, we show in Figure 5 the FSC with the ground truth for the three algorithms at specific Fourier shells and across epochs.Panel (a) shows the FSC for a low-frequency Fourier shell, panel (b) shows a medium-frequency shell, and panel (c) shows a high-frequency shell.At low resolution, all three algorithms are almost indistinguishable, while at medium resolution, the non-preconditioned SGD shows slower convergence than the preconditioned SGD, with SGD with the the estimated preconditioner quickly approaching the accuracy of the fully preconditioned SGD.The advantage of using a preconditioner is seen most clearly at high resolution, where the FSC of the non-preconditioned SGD is much lower than the other two algorithms.SGD with an estimated preconditioner converges more slowly than at medium resolution, but eventually approaches the FSC of SGD with a precomputed preconditioner.The small difference in the FSC at the last epoch between the two preconditioned algorithms is likely due to a combination of factors such as the thresholding of the very high-frequency elements in the preconditioner and the limited accuracy of the preconditioner estimate after only 10 epochs.Lastly, we show in Figure 6 a complete overview of the shell-wise FSC for each algorithm using a heat map of the correlation as a function of the Fourier shell number and the epoch number.
Finally, in light of equation (10), the number of iterations or epochs required for a gradient-based algorithm to achieve an accuracy ϵ scales linearly with the condition number of the Hessian matrix H.We show in Figure 7 the impact of the condition number on the number of epochs required by each of the three algorithms to reach a certain FSC value in our experiments, for each Fourier shell.As the condition number of the preconditioned algorithms scales better with the resolution, the plots show the number of iterations growing more slowly with the resolution for the preconditioned algorithms than in the non-preconditioned case.shows the FSC of the final reconstruction with a ground truth for SGD with no preconditioner (blue, solid), SGD with a precomputed preconditioner (red, solid), and Algorithm 1, namely SGD with an estimated and thresholded preconditioner (green, solid).The green dashed line shows the convergence and FSC for SGD with an estimated but not thresholded preconditioner.In (c), the diagonal of the inverse of the (diagonal) preconditioning matrix D is reshaped into a 192 × 192 × 192 volume and averages along radial frequency shells are computed (blue), as well as the variance of the gradient computed over one epoch (orange), shown without thresholding (dashed) and with thresholding (solid).While preconditioning is not required at low resolution, as seen in (a), its effect becomes increasingly apparent at higher resolution.Note that, since in these experiments the projection matrices use trilinear interpolation, the Hessian matrix H is not diagonal, and yet the diagonal preconditioner is very effective.

Outlook and conclusion
In this article, we analyzed the the conditioning of the cryo-EM reconstruction problem with a view towards applying stochastic gradient methods efficiently at high resolution.The proposed preconditioner construction and the numerical experiments performed show promising results for high-resolution 3D refinement.While the analysis and experiments presented hold in the special case when the pose variables are known, this simplified setting captures the main difficulty of applying gradient-based algorithms at high resolution, namely the large condition number of the Hessian of the loss function due to the particular structure of the projection operator in the Fourier domain.This proof-of-concept work shows the potential of the SGD algorithm for the more general cryo-EM reconstruction problem.There are a number of benefits that such an approach would provide: 1.The main advantage is the improved convergence speed.While the EM algorithm requires a full pass through the entire dataset at each iteration, SGD methods only use a mini-batch of the particle images.By estimating a preconditioner during the reconstruction process, the convergence of SGD improves at high resolution, while also benefiting from the speed of using mini-batches.In contrast, EM implicitly computes the same preconditioner, but at an increased computational cost due to requiring the entire dataset at each iteration.
2. A critical component of the current EM approaches is the ℓ 2 regularizer, which makes the maximization step computationally tractable.SGD, on the other hand, is compatible with other regularization methods, and one could take advantage, for example, of the Wilson prior [54] and learned regularization methods [55][56][57].
3. While the numerical experiments presented here illustrate the performance of the estimated preconditioner with a simple SGD implementation, the preconditioner is compatible with existing and more sophisticated stochastic gradient methods used in established cryo-EM software.
Unifying the steps of the ab initio reconstruction and high-resolution refinement using a single algorithm is not only more consistent conceptually, but also a practical improvement, allowing a more streamlined implementation.
We defer to future work the analysis and the preconditioner construction in the general case, where the pose variables are not known.The main difficulty in the general case over the setting of our analysis is that, when marginalizing over the unknown poses (see (5)), the objective function is no longer quadratic.Therefore, the Hessian depends on the current volume iterate.However, it is expected that the pose prior distributions are already narrow at high resolution and do not vary considerably from one epoch to another.Moreover, at each iteration, only a subset of the pose variables are updated (those corresponding to the particle images in the current mini-batch), and existing efficient pose sampling techniques can be used.Therefore, an approach to estimate the preconditioner in the general case based on similar ideas to the ones presented in this article is a promising future direction.
z = 0 slice of H for N = 5 z = 0 slice of H for N = 100

Figure 1 :
Figure 1: The condition number κ(H) for nearest-neighbor interpolation and no CTF.(a) Lower bound (blue, solid) and average value (red, dashed) of κ(H) for N = 10000 images, Fourier radius of R voxels in the range {1, 16, 32, 48, 64, . . ., 304}, and λ = 10 −8 .The lower bound is computed using equation (21), where p(R) is computed empirically using spherical and circular shells of radius R. To compute the average value of κ(H) for each R, 10 sets of N uniformly distributed orientations are generated and the average condition number is plotted, as well as error bars showing the minimum and maximum values of the condition number for these sets of orientations.(b,c) The z = 0 plane of the diagonal of the nearest-neighbor Hessian matrix H reshaped as a 3D volume, for N = 5 and N = 100 images, respectively, and grid size 256.

Figure 2 :
Figure2: Estimating the diagonal of H using(23), where H is the Hessian matrix of the loss function f in(12) with trilinear interpolation (and therefore, H is not diagonal).(a) Relative ℓ 2 error of the estimated diagonal of H using Hutchinson's diagonal estimator(11) and the proposed exponential average(23) between an initial estimate D (0) (here the identity matrix) and the average given by Hutchinson's estimator.The larger relative error of the proposed method at the first iterations is due to the error in D (0) = I Mv×Mv .(b) The individual elements of the estimated diagonal plotted against their true values.The accuracy in the smaller elements is lower, as expected for highfrequency elements not mapped by the projection operators of many particle images, i.e. elements at high-frequency indices j with small |Ω j |.

Figure 3 :
Figure 3: The condition number κ(H) for the experimental dataset used in Section 4, with N = 50000 images, grid of dimensions 192 × 192 × 192, and λ = 10 −8 .(a) Lower bound of κ(H) (blue, solid) computed using (21), the value of κ(H) for the given dataset without CTF (red, dashed) and with CTF (green, dashdot), and the value of 1/λ (black, dotted).The condition numbers computed for this plot are for H with nearest-neighbor interpolation.(b) The z = 0 plane of the diagonal of the trilinear Hessian matrix H for the given dataset, reshaped as a 3D volume.Note the effect of the CTF, in contrast to Figure 1(c), which does not have a CTF.

Figure 4 :
Figure 4: Results of the numerical experiments with trilinear interpolation projection matrices P tri ϕ for N = 50000 images of 192 × 192 pixels.Panel (a) shows the loss function value and panel (b)shows the FSC of the final reconstruction with a ground truth for SGD with no preconditioner (blue, solid), SGD with a precomputed preconditioner (red, solid), and Algorithm 1, namely SGD with an estimated and thresholded preconditioner (green, solid).The green dashed line shows the convergence and FSC for SGD with an estimated but not thresholded preconditioner.In (c), the diagonal of the inverse of the (diagonal) preconditioning matrix D is reshaped into a 192 × 192 × 192 volume and averages along radial frequency shells are computed (blue), as well as the variance of the gradient computed over one epoch (orange), shown without thresholding (dashed) and with thresholding (solid).

Figure 5 :
Figure5: FSC for individual Fourier shells across epochs for a low-frequency shell (a), a middlefrequency shell (b) and a high-frequency shell (c).While preconditioning is not required at low resolution, as seen in (a), its effect becomes increasingly apparent at higher resolution.Note that, since in these experiments the projection matrices use trilinear interpolation, the Hessian matrix H is not diagonal, and yet the diagonal preconditioner is very effective.

Figure 6 : 8 Figure 7 :
Figure 6: FSC for individual Fourier shells as a function of the Fourier shell index and epoch number for SGD with no preconditioner (a), SGD with a precomputed preconditioner (b) and SGD with an estimated and thresholded preconditioner (c), with projection implemented using trilinear interpolation.